Web Data Management Web Search
نویسندگان
چکیده
For personal use only, not for distribution. 6 Further Reading 35 With a constantly increasing size of dozens of billions of freely accessible documents, one of the major issues raised by the World Wide Web is that of searching in an effective and efficient way through these documents to find those that best suit a user's need. The purpose of this chapter is to describe the techniques that are at the core of today's search engines (such as Google 1 , Yahoo! 2 , Bing 3 , or Exalead 4), that is, mostly keyword search in very large collections of text documents. We also briefly touch upon other techniques and research issues that may be of importance in next-generation search engines. This chapter is organized as follows. In Section 1, we briefly recall the Web and the languages and protocols it relies upon. Most of these topics have already been covered earlier in the book, and their introduction here is mostly intended to make the present chapter self-contained. We then present in Section 2 the techniques that can be used to retrieve pages from the Web, that is, to crawl it, and to extract text tokens from them. First-generation search engines, exemplified by Altavista 5 , mostly relied on the classical information retrieval (IR) techniques, applied to text documents, that are described in Section 3. The advent of the Web, and more generally the steady growth of documents collections managed by institutions of all kinds, has led to extensions of these techniques. We address scalability issues in Section 3.3, with focus on centralized indexing. Distributed approaches are investigated in Chapter ??. The graph structure of the Web gives rises to ranking techniques that very 3 effectively complement information retrieval. We conclude with a brief discussion of currently active research topics about Web search in Section 5. 1 The World Wide Web Whereas the Internet is a physical network of computers (or hosts) connected to each other from all around the world, the World Wide Web, WWW or Web in short, is a logical collection of hyperlinked documents shared by the hosts of this network. A hyperlinked document is just a document with references to other documents of the same collection. Note that documents of the Web may refer both to static documents stored on the hard drive of some host of the Internet and to dynamic documents that are generated …
منابع مشابه
Analysis of users’ query reformulation behavior in Web with regard to Wholis-tic/analytic cognitive styles, Web experience, and search task type
Background and Aim: The basic aim of the present study is to investigate users’ query reformulation behavior with regard to wholistic-analytic cognitive styles, search task type, and experience variables in using the Web. Method: This study is an applied research using survey method. A total of 321 search queries were submitted by 44 users. Data collection tools were Riding’s Cognitive Style A...
متن کاملTowards Supporting Exploratory Search over the Arabic Web Content: The Case of ArabXplore
Due to the huge amount of data published on the Web, the Web search process has become more difficult, and it is sometimes hard to get the expected results, especially when the users are less certain about their information needs. Several efforts have been proposed to support exploratory search on the web by using query expansion, faceted search, or supplementary information extracted from exte...
متن کاملQuery Architecture Expansion in Web Using Fuzzy Multi Domain Ontology
Due to the increasing web, there are many challenges to establish a general framework for data mining and retrieving structured data from the Web. Creating an ontology is a step towards solving this problem. The ontology raises the main entity and the concept of any data in data mining. In this paper, we tried to propose a method for applying the "meaning" of the search system, But the problem ...
متن کاملFinding Community Base on Web Graph Clustering
Search Pointers organize the main part of the application on the Internet. However, because of Information management hardware, high volume of data and word similarities in different fields the most answers to the user s’ questions aren`t correct. So the web graph clustering and cluster placement in corresponding answers helps user to achieve his or her intended results. Community (web communit...
متن کاملA Technique for Improving Web Mining using Enhanced Genetic Algorithm
World Wide Web is growing at a very fast pace and makes a lot of information available to the public. Search engines used conventional methods to retrieve information on the Web; however, the search results of these engines are still able to be refined and their accuracy is not high enough. One of the methods for web mining is evolutionary algorithms which search according to the user interests...
متن کاملA New Hybrid Method for Web Pages Ranking in Search Engines
There are many algorithms for optimizing the search engine results, ranking takes place according to one or more parameters such as; Backward Links, Forward Links, Content, click through rate and etc. The quality and performance of these algorithms depend on the listed parameters. The ranking is one of the most important components of the search engine that represents the degree of the vitality...
متن کامل